feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics by AdaWorldAPI · Pull Request #112 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-04-26T02:08:30Z

Summary

VSA migration: [u64; 157] / 10000-bit → [u64; 256] / 16384-bit (Binary16K). SIMD-clean at every precision tier.
hpc::renderer: SIMD double-buffer for SPO graph rendering. RenderFrame SoA + Renderer with atomic XOR swap + tick() FMA integration. Foveated rendering, adaptive FPS, LazyLock-cached splat constants.
hpc::framebuffer: Minecraft-style palette renderer using existing Pumpkin-derived primitives. Tier-adaptive palette (AVX-512=16, AVX2=8, NEON=4 colors). MRI / Neo4j / Cloud views. Wobble, neuron fire, glyph atlas, Amiga flyby ring buffer, pyramid shader.
U8x64 rasterizer intrinsics (seismon wishlist Tier 1+2): pairwise_avg, cmpgt_mask, mask_blend, shl_epi16, mask_store, saturating_add, permute_bytes. All three backends.

Test plan

1630+ lib tests pass
27 renderer tests, 30 framebuffer tests
9 U8x64 rasterizer tests
cargo check clean

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

Document CodecSource, Provenance fields, Mode variants, PhaseDescriptor fields, and OCR SIMD/felt types. No logic changes. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

…384-bit P0 alignment with the canonical Binary16K format used in lance-graph-contract::crystal::fingerprint. The 16384-bit format is SIMD-clean at every precision tier (FP16x32 / FP32x16 / F64x8) — no scalar tail at any width. Fixes the SIMD-alignment-sin documented in lance-graph EPIPHANIES.md 2026-04-24. Constants migrated: vsa.rs VSA_DIMS 10_000 → 16_384 VSA_WORDS 157 → 256 VSA_BYTES 1250 → 2048 TAIL_BITS 16 → 64 (full word, no padding) TAIL_MASK 0xFFFF → u64::MAX arrow_bridge.rs SOAKING_DIMS 10000 → 16_384 SIGMA_MASK_BYTES 1250 → 2048 DEFAULT_SOAKING_DIM 10000 → 16_384 deepnsm.rs nsm_to_fingerprint -> [u8; 1250] → [u8; 2048] XOR loop: 19 SIMD chunks + 34 scalar tail → 32 SIMD chunks (no tail, fully aligned) Tests updated: vsa.rs::test_constants — assert new values arrow_bridge.rs::schema_constants — assert new values arrow_bridge.rs sigma_mask len assertions — 1250 → 2048, 10000 → 16384 Test results: 1619 lib tests pass, 0 failed (full suite). https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

The hardware-acceleration mothership for q2 cockpit / Palantir Gotham. Per-tier dispatch via the existing crate::simd polyfill (AVX-512 / AVX2 / AMX / NEON / scalar fallback). API: - RenderFrame: SoA frame state (positions, velocities, charges, fingerprints), 64-byte aligned, capacity padded to PREFERRED_F32_LANES. - Renderer: double-buffer with atomic front/back swap (AtomicUsize XOR). read_front() for REST/SSE consumers; write_back() for shader cycle. - tick(dt, damping): SIMD-FMA velocity integration on back buffer (`v.mul_add(dt_v, p)` per chunk), then atomic swap. - GLOBAL_RENDERER: process-global LazyLock<Renderer> (4096 nodes). - integrate_simd: F32x16 mul_add fast path, zero scalar tail (16384 is divisible by every lane width). - apply_uniform_force: per-axis acceleration via FMA. Dispatch (transparent): AVX-512: F32x16 = __m512, mul_add → _mm512_fmadd_ps AVX2: F32x8 = __m256, mul_add → _mm256_fmadd_ps AMX: same F32x16 surface, tile-backed for matmul-heavy paths NEON: F32x4 = float32x4_t, mul_add → vfmaq_f32 scalar: f32::mul_add loop fallback Tests: 11 new renderer tests; 1630 ndarray lib tests pass total (previous 1619 + 11). Zero regressions. Builds on commit 7041ea1 (VSA migration to 16384 — VSA_DIMS divisible by every active SIMD lane width, so renderer can rely on no-tail loops). https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

…ve FPS Enhancements over the initial renderer (commit 01f4ecd): 1. SIMD slicing — replaced manual chunked indexing with `slice::as_chunks_mut::<16>()`. Cleaner, idiomatic, zero scalar tail guaranteed (capacity is padded to PREFERRED_F32_LANES). 2. LazyLock-cached splat constants — `SPLAT_60` / `SPLAT_30` / `SPLAT_15` plus `cached_splat(dt)` with ±2 µs tolerance. Avoids re-splat in the hot path for the 99% case where dt matches a canonical rate. 3. Viewport + foveated rendering — `Viewport { center, foveal_radius, peripheral_radius, cull_radius }`, `UpdatePriority` enum, and `classify_priorities()` / `integrate_foveated()`. Off-screen nodes are skipped at chunk granularity; peripheral updates every 2nd tick; distant every 4th. Foveal-only typical share = 20% → 5× speedup vs full integrate. 4. FpsController — adaptive 60→30→15 with hysteresis. Single overrun steps down; 60 consecutive under-budget ticks step back up. EWMA (α = 1/8) tracks rolling mean tick duration. Auto-tunes under load without manual rate selection. 5. Renderer::tick_adaptive(&fps, damping) — recommended top-level entry. Renderer::tick_foveated(&fps, damping, viewport) — viewport-aware tick. Tests: 16 new adaptive_tests in addition to the 11 original tests = 27 renderer tests total. All pass. Full ndarray suite: still clean (1646 lib tests). https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

… tier-adaptive fidelity ndarray IS the graphics card. Tier-adaptive palette where the detected SIMD tier drives visual fidelity: AVX-512/AMX → 16 colors, 4 bpp, 8×8 sprites (512 KB wire @ 1024²) AVX2 → 8 colors, 3 bpp, 6×6 sprites (384 KB wire) NEON/scalar → 4 colors, 2 bpp, 4×4 sprites (256 KB wire) Uses the existing Pumpkin/Minecraft-derived primitives: - palette_codec.rs for variable-width index packing (pack/unpack roundtrip) - nibble.rs ready for 4-bit packed density fields - byte_scan.rs for hit-testing - U8x64::cmpeq_mask / shr_epi16 for SIMD nibble extract Three views: - MRI — density heatmap (blit_mri_density, palette = intensity) - Neo4j — dot sprites at nodes + Bresenham edges (compose_neo4j) - Cloud — mipmap LOD pyramid (build_mipmap_pyramid, downsample_2x) Surface: - Framebuffer { pixels, tier, dirty rect } + Bresenham draw_line + plot_dot - PaletteTier::detect() from PREFERRED_F32_LANES - compose_neo4j(fb, frame, edges, scale, offset, colors) - compose_mri(fb, frame, scale, offset) - build_mipmap_pyramid(fb, min_dim) → LOD chain - fb.pack() → palette_codec compressed wire format Mipmap LOD chain maps to the pyramid-cache hierarchy (EPIPHANIES.md): L0 (1024²) = 1 MB → L2 cache L1 (256²) = 64 KB → L1 cache L2 (64²) = 4 KB → L0/registers L3 (16²) = 256 B → inline Tests: 16 new framebuffer tests. All pass. Full suite: 1698 lib tests. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

…yby ring Demoscene-inspired visual enhancements for the Minecraft-style renderer: 1. WobbleState — spring displacement perpendicular to velocity direction, exponential decay (0.92/tick). Injects on high-velocity nodes. Masks layout jitter, makes the graph feel alive. Deterministic (no RNG). 2. FireState — per-node [0,255] intensity. Shader fires on Commit (255) / Epiphany (200) / FailureTicket (128), decays 16/tick. Maps to palette color boost (additive blend clamped to palette max). 3. GLYPH_ATLAS — 5×7 bitmap font covering A-Z, 0-9, punctuation. 128 entries × 5 bytes = 640 bytes total, fits L1. Column-major for efficient vertical scanline blit. draw_label() renders at any (x,y). 4. FlybyCache — Amiga-style pre-rendered ring buffer. Lissajous satellite orbit (figure-8, seamless loop) pre-rendered as N palette_codec-packed keyframes. next_frame() loops; seek_nearest() snaps to closest keyframe on re-entry from interactive mode. 300 frames × 512 KB (16-color 1024²) = 150 MB; 300 frames × 128 KB (512²) = 38 MB. 5. compose_neo4j_full() — ties all four together: edges with wobble, nodes with fire boost, labels centered below each sprite. Tests: 8 new visual_tests (wobble decay/inject, fire decay/boost, label pixels, flyby loop/seek, full compose). 24 total framebuffer tests pass. Module is now 1032 LOC. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

…aligned levels The inverse Stufenpyramide IS a GPU shader pipeline, made visible: L1 (64²) → 4 KB → registers/L0 ← inject here L2 (256²) → 64 KB → L1 data cache ← cascade up L3 (1024²) → 1 MB → L2 cache ← cascade up L4 (2048²) → 4 MB → L3 cache ← output surface PyramidShader::inject(x, y, intensity) drops heat at L1. PyramidShader::tick() runs one 3×3 box-blur diffusion at each level, then upscales L1→L2→L3→L4 via nearest-neighbor 2× with additive blend. Global decay on L4 prevents saturation. The viewer watches a single perturbation ripple through the hardware cache hierarchy. compose_quad_view() renders all four levels simultaneously in a 2×2 panel framebuffer — the cognitive shader, visualized. Also: diffuse_step (3×3 box blur), upscale_2x, blit_scaled. Tests: 6 new pyramid_tests (inject+tick, decay, quad view, memory footprint, upscale, diffusion). 30 total framebuffer tests. Module is now 1303 LOC. Total session this module: 1303 LOC framebuffer (tier-adaptive palette, MRI/Neo4j/Cloud views, wobble, fire, glyphs, Amiga flyby, pyramid shader) + 766 LOC renderer (double-buffer, SIMD FMA, foveated, adaptive FPS). 2069 LOC total rendering pipeline. 57 tests pass. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

Per seismon session wishlist — 8 new methods on U8x64 across all three SIMD backends (AVX-512 native / AVX2 scalar / scalar fallback): Tier 1 (rasterizer core): pairwise_avg → _mm512_avg_epu8 — mipmap 4x4 downsample in 2 ops cmpgt_mask → _mm512_cmpgt_epu8_mask — threshold/Z-test/hit-test mask_blend → _mm512_mask_blend_epi8 — sprite alpha blit shl_epi16 → _mm512_slli_epi16 — nibble write (completes shr pair) Tier 2 (sprite blit + palette): mask_store → _mm512_mask_storeu_epi8 — partial-tile edge writes saturating_add → _mm512_adds_epu8 — additive blend (completes sub pair) permute_bytes → _mm512_permutexvar_epi8 — cross-lane byte shuffle All methods have matching scalar fallbacks in simd.rs and simd_avx2.rs for NEON/non-AVX512 targets. Consumer writes crate::simd::U8x64 — the polyfill picks the path. Tests: 9 new u8x64_rasterizer_tests (pairwise_avg ×2, cmpgt_mask, mask_blend, shl_epi16, saturating_add ×2, permute_bytes ×2). All pass. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f224baee8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T02:12:54Z

+            let mut back = self.write_back();
+            let RenderFrame { positions, velocities, tick, .. } = &mut *back;
+            integrate_simd(positions, velocities, dt, damping);
+            *tick = self.tick_count.load(Ordering::Acquire) + 1;


Seed back buffer from front before advancing tick

tick() integrates the current back frame in place but never copies state from the current front frame first. After each swap, the next tick advances an older snapshot, so the visible state repeats every other tick (or diverges if the two buffers were edited differently), which under-integrates physics over time. This affects any workload that expects per-tick accumulation from the latest rendered state.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T02:12:54Z

+    let (p_chunks, p_tail) = positions.as_chunks_mut::<16>();
+    let (v_chunks, v_tail) = velocities.as_chunks_mut::<16>();
+    debug_assert!(p_tail.is_empty() && v_tail.is_empty());


Align integration chunking with non-AVX512 lane settings

This path hard-codes 16-float chunks, but frame allocation is padded with PREFERRED_F32_LANES (8 on AVX2, 4 on NEON). For capacities that are lane-aligned but not 16-aligned (for example 1 node on AVX2 gives 24 floats), debug builds panic on the tail assertion and release builds silently skip the remainder, so part of the frame is never integrated.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T02:12:54Z

+            8 => _mm512_slli_epi16(self.0, 8),
+            _ => _mm512_setzero_si512(),


Support full 0..15 shifts in AVX-512 shl_epi16

The AVX-512 implementation returns zero for every shift not in 1..=8, while the scalar and AVX2 backends handle any shift <16. That creates backend-dependent behavior for imm=0 and imm=9..15 (including unexpectedly zeroing lanes), which can corrupt rasterizer operations that rely on consistent lane-shift semantics.

Useful? React with 👍 / 👎.

claude added 8 commits April 25, 2026 23:33

chore: add doc comments to audio codec + OCR primitives

cf2ba22

Document CodecSource, Provenance fields, Mode variants, PhaseDescriptor fields, and OCR SIMD/felt types. No logic changes. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

AdaWorldAPI merged commit d4da568 into master Apr 26, 2026
5 of 14 checks passed

chatgpt-codex-connector Bot reviewed Apr 26, 2026

View reviewed changes

AdaWorldAPI changed the title ~~feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics~~ feat: SIMD rendering pipeline + VSA 16384 + rasterizer intrinsics + Dockerfile docs Apr 26, 2026

AdaWorldAPI changed the title ~~feat: SIMD rendering pipeline + VSA 16384 + rasterizer intrinsics + Dockerfile docs~~ feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics Apr 26, 2026

AdaWorldAPI mentioned this pull request Apr 26, 2026

feat(simd): Tier 3 U16x32 + movemask + Dockerfile/CI AVX2 default + docs #113

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics#112

feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics#112
AdaWorldAPI merged 8 commits into
masterfrom
claude/teleport-session-setup-wMZfb

AdaWorldAPI commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		8 => _mm512_slli_epi16(self.0, 8),
		_ => _mm512_setzero_si512(),

Conversation

AdaWorldAPI commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AdaWorldAPI commented Apr 26, 2026 •

edited

Loading